Data is Everywhere: Where to Look and How to Plot it

Kieran Hunt

3 February 2016

Where to Look

Data is everywhere

How to Plot it

Napolean’s Invasion of Russia

Charles Minard’s Map of Napolean’s Invasion of Russia:

Charles Minard’s Map of Napolean’s Invasion of Russia

2 easy ways to plot data (you’ll never believe number 2!)

  • We’re all a bunch of millennials. We won’t read anything that hasn’t been turned into a list with a click-baity title.
  • I’ve got a dataset of our generation. Buzzfeed!

What the data looks like

  • A CSV of 15 101 Buzzfeed listicles.
  • With the columns: “title”, “listicle_size”, “num_fb_shares”, and “url”
  • e.g.: “6 Reasons To Fall In Love With Maggie Stiefvaters Raven Cycle”,6,2276,“[truncated url]”
  • Available here.

What we’ll need

  • A copy of the R programming language (Your package manager should have this)
  • The following packages (CRAN should have these):
    • ggplot2
    • RColorBrewer
    • scales

Grab those dependencies:

install.packages(c("ggplot2", "RColorBrewer", "scales"),
    repos='http://r.adu.org.za/')
library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)

Read in the data

R makes this very easy for us. Just a single line and we can work with the data.

df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)

That line turns the csv into a data frame - sort of like a table in R. You can even just put a hyperlink in there and R will download the file. header=T tells R that the first line is the header.

Always try to be answering a question

  • It helps to ask a question and then try to answer it by plotting some data.
  • Our questions:
  • What is the average length of a listicle that Buzzfeed publishes?
  • Does the length of a listicle affect how popular it is on social media?

What is the distribution of listicle lengths in that Buzzfeed publishes?

library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(listicle_size)) + geom_histogram(binwidth=1)
plot

We pass in the dataframe to ggplot. We then specify the aesthetics, in this case listicle_size is a column in the dataframe (R knows that from the headings in the CSV) and ggplot works out that we want this on the x-axis.

Result

Let’s make that look a bit better

  • That graph looked pretty good but we can do better.
  • I’ve got a nice little function that acts like a theme for a ggplot graph. You can get it here.
library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)
source("fte-theme.R")
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(listicle_size))
    + geom_histogram(binwidth=1)
    + fte_theme()
plot

Result

We’re getting there

So that looked a bit better. But I think we can still add a bit more.

Let’s give it some axis titles, a nice heading, and fit a few more breaks along the x and y axes. We’ll also add a touch of color and transparency. I’ve omitted some of the imports for brevity.

df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(listicle_size))
    + geom_histogram(binwidth=1, fill="#bc70e7", alpha=0.75)
    + fte_theme()
    + labs(title="Distribution of Listicle Sizes for BuzzFeed Listicles",
    x="# of Entries in Listicle",
    y="# of Listicles")
    + scale_x_continuous(breaks=seq(0,50, by=5))
    + scale_y_continuous(labels=comma)
    + geom_hline(yintercept=0, size=0.4, color="black")
plot

Result

Result

That looks a bit iffy

You may be able to understand that graph but it has one big issue. There are a few listicles in the dataset with over 1 000 000 shares and that forces the rest of the points really close to the bottom. But we can fix this. With a log scale!

df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
      geom_point(alpha=0.05) +
      scale_y_log10(labels=comma)
plot

Also note that I’ve given each point an alpha value of 0.05. This means that each point is only 5% opaque (or 95% transparent). This serves to enhance places in the plot where many points are congregated.

Result

But can it look better?

Of course! Add our theme again and some axis titles.

df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
    geom_point(alpha=0.05) +
    scale_y_log10(labels=comma) +
    fte_theme() +
    labs(x="# of Entries in Listicle",
    y="# of Facebook Shares",
    title="FB Shares vs. Listicle Size for BuzzFeed Listicles")
plot

Result

Let’s add a few final touches

df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
    geom_point(alpha=0.05, color="#bc70e7") +
    scale_x_continuous(breaks=seq(0,50, by=5)) +
    scale_y_log10(labels=comma, breaks=10^(0:6)) +
    scale_y_log10(labels=comma) +
    geom_hline(yintercept=1, size=0.4, color="black") +
    geom_smooth(alpha=0.25, color="black", fill="black") +
    fte_theme() +
    labs(x="# of Entries in Listicle",
    y="# of Facebook Shares",
    title="FB Shares vs. Listicle Size for BuzzFeed Listicles")
plot

We’ve also added a line of best fit (with confidence).

Result